In recent times, there has been a widespread acknowledgment of the perception that the food industry is responsible for the production of a significant amount of highly processed foods. While this understanding prevails, a closer examination through a comprehensive dataset provides valuable insights into the extent of food processing. The dataset in question encompasses a staggering 50,000 products, offering a detailed inventory of their ingredients and corresponding values for sugar, protein, and fat content.
This dataset serves as a critical tool in unraveling the intricacies of the food industry's contribution to the prevalence of processed foods. By meticulously cataloging ingredients and nutritional information, it unveils the composition of a vast array of products, shedding light on the intricate balance between sugar, protein, and fat within them. This level of granularity allows researchers, policymakers, and the public to make informed decisions about dietary choices and fosters a deeper understanding of the impact of processed foods on overall health.
Armed with this dataset, stakeholders can identify patterns, trends, and outliers within the realm of processed foods. This information becomes particularly pertinent in discussions surrounding public health, nutritional education, and regulatory measures aimed at promoting a healthier food landscape. As society grapples with concerns related to diet-related diseases and obesity, the insights derived from this dataset contribute significantly to the ongoing discourse on the role of the food industry in shaping our collective nutritional habits.
In essence, this dataset serves as a window into the world of processed foods, offering transparency and data-driven perspectives that go beyond general beliefs. It empowers individuals and institutions to engage in informed conversations about the nature of our food supply, fostering a more nuanced understanding of the intricate relationship between the food industry and the products that populate our shelves.
@misc{GroceryDB, title={GroceryDB: Prevalence of Processed Food in Grocery Stores}, author={Babak Ravandi and Peter Mehler and Albert-László Barabási and Giulia Menichetti}, year={2022}, eprint={medRxiv:2022.04.23.22274217}, url = {https://www.medrxiv.org/content/10.1101/2022.04.23.22274217} }
The goal of this study is to determine the extent to which products are highly processed and to identify the key parameters that characterize a highly processed product.
This is the data dictionary section content.
</div>
#Data Analysis & Data wrangling
import numpy as np
import pandas as pd
import os
# import missingno as mn
from random import sample
from numpy.random import uniform
from math import isnan
#Visualization
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
%matplotlib inline
#Plotly Libraris
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.offline as pyo
from plotly import tools
from plotly.colors import n_colors
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
from plotly import tools
from IPython.display import display, HTML
init_notebook_mode(connected=True)
# ML Libraries
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split # Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
import warnings
warnings.filterwarnings('ignore')
path =
save_cvs_path =
save_image_path =
# import the primary dataset
df_1 = pd.read_csv(os.path.join(path,'GroceryDB_foods.csv'))
df_1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50468 entries, 0 to 50467 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 50468 non-null int64 1 original_ID 50468 non-null object 2 name 50467 non-null object 3 store 50468 non-null object 4 harmonized single category 50468 non-null object 5 brand 50078 non-null object 6 f_FPro 26657 non-null float64 7 f_FPro_P 26657 non-null object 8 f_min_FPro 26387 non-null float64 9 f_std_FPro 26387 non-null float64 10 f_FPro_class 26657 non-null float64 11 ingredientList 50468 non-null object 12 has10_nuts 50468 non-null int64 13 is_Nuts_Converted_100g 50468 non-null int64 14 Protein 34515 non-null float64 15 Total Fat 34645 non-null float64 16 Carbohydrate 34633 non-null float64 17 Sugars, total 32370 non-null float64 18 Fiber, total dietary 28945 non-null float64 19 Calcium 28504 non-null float64 20 Iron 28286 non-null float64 21 Sodium 34463 non-null float64 22 Vitamin C 14534 non-null float64 23 Cholesterol 29476 non-null float64 24 Fatty acids, total saturated 29810 non-null float64 25 Total Vitamin A 14023 non-null float64 dtypes: float64(16), int64(3), object(7) memory usage: 10.0+ MB
df_1.isnull().sum()
Unnamed: 0 0 original_ID 0 name 1 store 0 harmonized single category 0 brand 390 f_FPro 23811 f_FPro_P 23811 f_min_FPro 24081 f_std_FPro 24081 f_FPro_class 23811 ingredientList 0 has10_nuts 0 is_Nuts_Converted_100g 0 Protein 15953 Total Fat 15823 Carbohydrate 15835 Sugars, total 18098 Fiber, total dietary 21523 Calcium 21964 Iron 22182 Sodium 16005 Vitamin C 35934 Cholesterol 20992 Fatty acids, total saturated 20658 Total Vitamin A 36445 dtype: int64
#delate empty rows
df_1 = df_1.dropna(subset=['brand', 'name'])
df_1 = df_1.rename(columns={'Unnamed: 0': 'index',
"Sugars, total": "sugar_total",
"harmonized single category" : "category",
"name" : "product",
"ingredientList" : "ingredients"
})
# Specify the column(s) to delete
columns_to_delete = ["index","f_FPro_P"]
# Create a new DataFrame by dropping the specified columns
df_cleaned = df_1.drop(columns=columns_to_delete)
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 50077 entries, 0 to 50467 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 original_ID 50077 non-null object 1 product 50077 non-null object 2 store 50077 non-null object 3 category 50077 non-null object 4 brand 50077 non-null object 5 f_FPro 26588 non-null float64 6 f_min_FPro 26318 non-null float64 7 f_std_FPro 26318 non-null float64 8 f_FPro_class 26588 non-null float64 9 ingredients 50077 non-null object 10 has10_nuts 50077 non-null int64 11 is_Nuts_Converted_100g 50077 non-null int64 12 Protein 34400 non-null float64 13 Total Fat 34529 non-null float64 14 Carbohydrate 34518 non-null float64 15 sugar_total 32265 non-null float64 16 Fiber, total dietary 28874 non-null float64 17 Calcium 28428 non-null float64 18 Iron 28206 non-null float64 19 Sodium 34348 non-null float64 20 Vitamin C 14488 non-null float64 21 Cholesterol 29397 non-null float64 22 Fatty acids, total saturated 29735 non-null float64 23 Total Vitamin A 13981 non-null float64 dtypes: float64(16), int64(2), object(6) memory usage: 9.6+ MB
df_cleaned.head()
| original_ID | product | store | category | brand | f_FPro | f_min_FPro | f_std_FPro | f_FPro_class | ingredients | ... | Carbohydrate | sugar_total | Fiber, total dietary | Calcium | Iron | Sodium | Vitamin C | Cholesterol | Fatty acids, total saturated | Total Vitamin A | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | wf_green-mountain-creamery-yoyummy-yogurt-sque... | Yoyummy Yogurt Squeeze Mixed Berry, 3.5 count | WholeFoods | baby-food | Green Mountain Creamery™ | 0.897833 | 0.882500 | 0.011457 | 3.0 | ['Cultured Pasteurized Grade A Lowfat Milk', '... | ... | 15.000000 | 14.000000 | 0.000000 | 0.150000 | 0.000000 | 0.060000 | 0.001200 | 0.005000 | 1.000000 | 0.02325 |
| 1 | wf_stonyfield-yobaby-whole-milk-baby-yogurt-po... | YoBaby Whole Milk Baby Yogurt Pouches, Peach &... | WholeFoods | baby-food | Stonyfield | 0.829028 | 0.812500 | 0.012965 | 3.0 | ['Our Family Recipe: Cultured Pasteurized Orga... | ... | 12.500000 | 9.375000 | 0.000000 | 0.189583 | 0.000000 | 0.078125 | NaN | 0.010417 | 2.083333 | NaN |
| 2 | wf_stonyfield-yobaby-whole-milk-baby-yogurt-po... | YoBaby Whole Milk Baby Yogurt Pouches, Banana ... | WholeFoods | baby-food | Stonyfield | 0.829028 | 0.812500 | 0.012965 | 3.0 | ['Cultured Pasteurized Organic Whole Milk', 'O... | ... | 12.500000 | 9.375000 | 0.000000 | 0.189583 | 0.000000 | 0.078125 | NaN | 0.010417 | 2.083333 | NaN |
| 3 | wf_earths-best-stage-2-organic-pears-baby-food... | Stage 2 Organic Pears Baby Food, 4 oz | WholeFoods | baby-food | Earth's Best | NaN | NaN | NaN | NaN | ['Organic Pears', 'Water', 'Ascorbic Acid (Vit... | ... | 12.389381 | 10.619469 | 1.769912 | 0.010619 | 0.000265 | 0.008850 | 0.014159 | NaN | NaN | 0.00000 |
| 4 | wf_kite-hill-strawberry-banana-yogurt-tubes-61... | Strawberry Banana Yogurt Tubes, 14 oz | WholeFoods | baby-food | Kite Hill | 0.599306 | 0.586944 | 0.008600 | 3.0 | ['Almond Milk (Water, Almonds)', 'Cane Sugar',... | ... | 16.000000 | 12.000000 | 2.000000 | 0.042000 | 0.000000 | 0.000000 | NaN | 0.000000 | 0.000000 | NaN |
5 rows × 24 columns
df_cleaned.describe()
| f_FPro | f_min_FPro | f_std_FPro | f_FPro_class | has10_nuts | is_Nuts_Converted_100g | Protein | Total Fat | Carbohydrate | sugar_total | Fiber, total dietary | Calcium | Iron | Sodium | Vitamin C | Cholesterol | Fatty acids, total saturated | Total Vitamin A | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 26588.000000 | 26318.000000 | 26318.000000 | 26588.000000 | 50077.000000 | 50077.000000 | 34400.000000 | 34529.000000 | 34518.000000 | 32265.000000 | 28874.000000 | 28428.000000 | 28206.000000 | 34348.000000 | 14488.000000 | 2.939700e+04 | 29735.000000 | 13981.000000 |
| mean | 0.728199 | 0.709464 | 0.015904 | 2.489356 | 0.530962 | 0.671526 | 27.061769 | 23.662984 | 67.249194 | 40.490996 | 14.115376 | 0.199089 | 0.002698 | 18.870239 | 0.023739 | 1.191841e+02 | 14.481413 | 0.006777 |
| std | 0.228380 | 0.234277 | 0.014627 | 1.019942 | 0.499045 | 0.495162 | 547.809048 | 284.452198 | 542.003023 | 896.208078 | 476.307004 | 1.676704 | 0.012914 | 2956.528612 | 0.261092 | 2.041346e+04 | 309.050654 | 0.112418 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.000212 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 |
| 25% | 0.612243 | 0.593889 | 0.008153 | 3.000000 | 0.000000 | 0.000000 | 0.416667 | 0.000000 | 5.555556 | 0.416667 | 0.000000 | 0.000000 | 0.000000 | 0.030000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 |
| 50% | 0.783556 | 0.763056 | 0.013470 | 3.000000 | 1.000000 | 1.000000 | 5.263158 | 5.555556 | 21.164394 | 4.200000 | 1.538462 | 0.035714 | 0.001071 | 0.250000 | 0.000000 | 0.000000e+00 | 1.960784 | 0.000000 |
| 75% | 0.908389 | 0.893889 | 0.020103 | 3.000000 | 1.000000 | 1.000000 | 10.869565 | 18.812795 | 57.142857 | 18.333333 | 3.625000 | 0.115044 | 0.002540 | 0.546747 | 0.006294 | 2.099237e-02 | 7.017544 | 0.000113 |
| max | 0.999056 | 0.998611 | 0.222207 | 3.000000 | 1.000000 | 1.000000 | 44000.000000 | 36000.000000 | 36000.000000 | 99099.099099 | 64000.000000 | 91.000000 | 1.071429 | 547619.047619 | 12.600000 | 3.500000e+06 | 33333.333333 | 11.627907 |
# Set a threshold for each column based on the maximum reasonable value
thresholds = {
'Protein': 100,
'Total Fat': 200,
'Carbohydrate': 500,
'sugar_total': 500,
'Fiber, total dietary': 50,
'Sodium': 200,
'Fatty acids, total saturated': 200,
# Add other columns as needed
}
# Identify and remove rows with values above the threshold
for col, threshold in thresholds.items():
df_cleaned = df_cleaned[df_cleaned[col] <= threshold]
# Check the shape of the cleaned DataFrame
print("Rows after removing outliers:", df_cleaned.shape[0])
Rows after removing outliers: 27543
# creating a DataFrame category count for export
df_category_count = pd.DataFrame({
'category': df_cleaned['category'].unique(),
'count': df_cleaned['category'].value_counts()
}).reset_index(drop=True)
# Filename to CVS Export
file_name = 'category_count_product'
# Concatenate the path and filename
file_path = os.path.join(save_cvs_path, f"{file_name}.csv")
# Export the DataFrame to CSV
df_category_count.to_csv(file_path, index=False)
from collections import Counter
# Assuming you have already removed unnecessary characters and spaces from the 'ingredients' column
cleaned_ingredients = [ingredient.strip(" '()[]").lower() for ingredients in df_cleaned['ingredients'] if isinstance(ingredients, str) for ingredient in ingredients.split(',')]
# Count the occurrences of each unique ingredient
ingredient_counts = Counter(cleaned_ingredients)
# Get the total number of unique ingredients
total_unique_ingredients = len(set(cleaned_ingredients))
# Display the total number of unique ingredients
print(f"Total number of unique ingredients: {total_unique_ingredients}")
# Display the top 10 ingredients and their counts
top_10_ingredients = ingredient_counts.most_common(10)
for i, (ingredient, count) in enumerate(top_10_ingredients, start=1):
print(f"{i}. '{ingredient}': {count} times")
Total number of unique ingredients: 44484 1. 'salt': 24078 times 2. 'water': 12088 times 3. 'sugar': 10770 times 4. 'sea salt': 5906 times 5. 'citric acid': 5042 times 6. 'folic acid': 4960 times 7. 'enzymes': 4741 times 8. 'niacin': 4711 times 9. 'natural flavor': 4605 times 10. 'soy lecithin': 4019 times
cleaned_ingredients = [ingredient.strip(" '()[]").lower() for ingredients in df_cleaned['ingredients'] if isinstance(ingredients, str) for ingredient in ingredients.split(',')]
# Count the occurrences of each unique ingredient
ingredient_counts = Counter(cleaned_ingredients)
# Create a DataFrame from the Counter object
df_ingredient_counts = pd.DataFrame(ingredient_counts.items(), columns=['Ingredient', 'Count'])
# Display the resulting DataFrame
print(df_ingredient_counts)
Ingredient Count
0 cultured pasteurized grade a lowfat milk 1
1 sugar 10770
2 casein 24
3 whey protein 61
4 raspberry puree concentrate 3
... ... ...
44479 peanut butter {peanuts 1
44480 hydrogenated rapeseed and cottonseed oil} 1
44481 pretzel balls [enriched wheat flour {wheat flour 1
44482 vegetable oil {may contain one or more of the ... 1
44483 mini peanut butter cups (milk chocolate [sugar 1
[44484 rows x 2 columns]
df_ingredient_counts
| Ingredient | Count | |
|---|---|---|
| 0 | cultured pasteurized grade a lowfat milk | 1 |
| 1 | sugar | 10770 |
| 2 | casein | 24 |
| 3 | whey protein | 61 |
| 4 | raspberry puree concentrate | 3 |
| ... | ... | ... |
| 44479 | peanut butter {peanuts | 1 |
| 44480 | hydrogenated rapeseed and cottonseed oil} | 1 |
| 44481 | pretzel balls [enriched wheat flour {wheat flour | 1 |
| 44482 | vegetable oil {may contain one or more of the ... | 1 |
| 44483 | mini peanut butter cups (milk chocolate [sugar | 1 |
44484 rows × 2 columns
# Filename to CVS Export
file_name = 'ingredient_counts_unique'
# Concatenate the path and filename
file_path = os.path.join(save_cvs_path, f"{file_name}.csv")
# Export the DataFrame to CSV
df_ingredient_counts.to_csv(file_path, index=False)
df_cleaned['ingredients'] = df_cleaned['ingredients'].str.strip()
df_cleaned['ingredients'] = df_cleaned['ingredients'].str.lower()
# Using unique()
unique_ingredients_direct = df_cleaned['ingredients'].unique()
# Using set and split
unique_ingredients_set = set()
for ingredients in df_cleaned['ingredients']:
if isinstance(ingredients, str):
unique_ingredients_set.update(ingredients.split(','))
# Display some examples
print("Examples from unique():", unique_ingredients_direct[:5])
print("\nExamples from set and split:", list(unique_ingredients_set)[:5])
Examples from unique(): ["['cultured pasteurized grade a lowfat milk', 'sugar', 'casein', 'whey protein', 'raspberry puree concentrate', 'blueberry puree concentrate', 'strawberry puree concentrate', 'kosher gelatin', 'native corn starch', 'pectin', 'fruit and vegetable juice concentrate (for color)', 'natural flavor', 'locust bean gum', 'vitamin a acetate', 'vitamin d3', 'tricalcium phosphate.']" "['our family recipe: cultured pasteurized organic whole milk, organic cane sugar, organic peach puree, organic oat flour, pectin, organic natural flavor, organic annatto extract and organic fruit juice concentrate (color), vitamin d3.5 live cultures: s. thermophilus, l. bulgaricus, l. acidophilus, bifidus and l. paracasei']" "['cultured pasteurized organic whole milk', 'organic cane sugar', 'organic banana puree', 'organic oat flour', 'pectin', 'natural flavor', 'organic fruit and vegetable juice concentrates (color)', 'vitamin d3']" "['almond milk (water, almonds)', 'cane sugar', 'strawberries', 'tapioca starch', 'bananas', 'natural flavor', 'fruit and vegetable juice [color]', 'locust bean gum', 'xanthan gum', 'citric acid', 'live active cultures: s. thermophilus, l. bulgaricus, l. acidophilus, bifidobacteria']" "['crust: whole grain wheat flour**, wheat flour**, evaporated cane syrup**, oats**, sunflower oil** and/or canola oil**, concentrated apple juice**, chia**, vanilla extract**, apple powder** (apple**, rice flour**, sunflower lecithin), baking powder (sodium acid pyrophosphate, rice starch**, sodium bicarbonate), sea salt, baking soda.', 'filling: cane syrup**, brown rice syrup**, dried cane syrup**, apple powder**, cornstarch**, yogurt powder** (nonfat dry milk**, lactic acid), strawberries**, spinach**, glycerin, pectin, natural flavor, citric acid, red cabbage extract (for color).', 'vitamins and minerals: vitamin a acetate, tricalcium phosphate, ascorbic acid, ferrous fumarate (iron), zinc gluconate, vitamin e acetate, niacinamide (vitamin b3), cyanocobalamine (vitamin b12), riboflavin (vitamin b2), thiamine hydrochloride (vitamin b1), pyridoxine hydrochloride (vitamin b6).', '**organic']"] Examples from set and split: [" mint*.'", ' agave syrup', " 'soup packet: shoyu powder (soybeans", " 'and guar gums])'", " 'orange preparation pieces [orange'"]
# Remove unnecessary characters and spaces
## df_cleaned['ingredients'] = df_cleaned['ingredients'].str.replace('[\[\]\'\s]', '', regex=True)
# Obtain the count of unique ingredients
unique_ingredients_count = len(df_cleaned['ingredients'].unique())
print(f"Number of unique ingredients after preprocessing: {unique_ingredients_count}")
Number of unique ingredients after preprocessing: 24412
# Add an 'Organic' column based on the presence of 'Organic' in the ingredient name
df_ingredient_counts['Organic'] = df_ingredient_counts['Ingredient'].str.contains('organic', case=False)
# Display the resulting DataFrame
print(df_ingredient_counts)
Ingredient Count Organic
0 cultured pasteurized grade a lowfat milk 1 False
1 sugar 10770 False
2 casein 24 False
3 whey protein 61 False
4 raspberry puree concentrate 3 False
... ... ... ...
44479 peanut butter {peanuts 1 False
44480 hydrogenated rapeseed and cottonseed oil} 1 False
44481 pretzel balls [enriched wheat flour {wheat flour 1 False
44482 vegetable oil {may contain one or more of the ... 1 False
44483 mini peanut butter cups (milk chocolate [sugar 1 False
[44484 rows x 3 columns]
df_ingredient_counts
| Ingredient | Count | Organic | |
|---|---|---|---|
| 0 | cultured pasteurized grade a lowfat milk | 1 | False |
| 1 | sugar | 10770 | False |
| 2 | casein | 24 | False |
| 3 | whey protein | 61 | False |
| 4 | raspberry puree concentrate | 3 | False |
| ... | ... | ... | ... |
| 44479 | peanut butter {peanuts | 1 | False |
| 44480 | hydrogenated rapeseed and cottonseed oil} | 1 | False |
| 44481 | pretzel balls [enriched wheat flour {wheat flour | 1 | False |
| 44482 | vegetable oil {may contain one or more of the ... | 1 | False |
| 44483 | mini peanut butter cups (milk chocolate [sugar | 1 | False |
44484 rows × 3 columns
# Filter out rows with counts less than 5
df_filtered_ingredients = df_ingredient_counts[df_ingredient_counts['Count'] >= 5]
df_filtered_ingredients
| Ingredient | Count | Organic | |
|---|---|---|---|
| 1 | sugar | 10770 | False |
| 2 | casein | 24 | False |
| 3 | whey protein | 61 | False |
| 5 | blueberry puree concentrate | 12 | False |
| 6 | strawberry puree concentrate | 24 | False |
| ... | ... | ... | ... |
| 44033 | ingredients: organic maltodextrin | 5 | True |
| 44178 | tricalcium phosphate (anti-caking agent | 5 | False |
| 44224 | ingredients: pure cane sugar | 5 | False |
| 44226 | organic powdered sugar (organic cane sugar | 19 | True |
| 44418 | salt. contains: peanuts. | 6 | False |
7094 rows × 3 columns
# Count occurrences of 'Organic'
df_filtered_ingredients_organic = df_filtered_ingredients['Organic'].value_counts()
df_filtered_ingredients_organic
Organic False 6155 True 939 Name: count, dtype: int64
# Filename to CVS Export
file_name = 'filtered_ingredients_organic'
# Concatenate the path and filename
file_path = os.path.join(save_cvs_path, f"{file_name}.csv")
# Export the DataFrame to CSV
df_filtered_ingredients.to_csv(file_path, index=False)
# Define a list of string columns to exclude
columns_to_exclude = [ "original_ID", "product", "store", "category", "brand", "ingredients", "f_min_FPro"]
# Create a new DataFrame by dropping the specified string columns
numeric_columns = df_cleaned.drop(columns=columns_to_exclude)
# Convert non-numeric columns to numeric, replacing non-convertible values with NaN
numeric_columns = numeric_columns.apply(pd.to_numeric, errors='coerce')
# Drop columns with all NaN values
numeric_columns = numeric_columns.dropna(axis=1, how='all')
# Calculate the correlation matrix
correlation_matrix = numeric_columns.corr()
fig = px.imshow(correlation_matrix, text_auto='.2f', height=1000)
fig.show()
If the relationship is 1, then the relationship is strong.
If the relationship is 0, then it means the relationship is neutral.
If the relationship is -1, then it means the relationship is negative or not strong.
from datetime import datetime
# Generate a unique image name using the current date and time
image_name = f"correlation_matrix_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
fig = px.scatter(df_cleaned, y="ingredient_total", x="sugar_total", height=500) fig.show()
numeric_columns = df_cleaned.select_dtypes(include='number').columns
# Exclude specific columns
columns_to_exclude = ['original_ID',
'product',
'store',
'category',
'brand',
'f_FPro',
'f_min_FPro',
'f_std_FPro',
'f_FPro_class',
'ingredients',
'has10_nuts',
'is_Nuts_Converted_100g',
'Sodium',
'Iron'
]
selected_dimensions = [col for col in numeric_columns if col not in columns_to_exclude]
# Create scatterplot matrix
## fig = px.scatter_matrix(df_cleaned, dimensions=selected_dimensions, title="Scatterplot Matrix")
## fig.update_layout(height=1500, width=1500)
## fig.show()
# Create scatterplot matrix
fig = px.scatter_matrix(df_cleaned, dimensions=selected_dimensions, title="Scatterplot Matrix")
fig.update_layout(height=1200, width=1200)
fig.show()
# Generate a unique image name using the current date and time
image_name = f"Scatterplot Matrix_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 27543 entries, 0 to 50458 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 original_ID 27543 non-null object 1 product 27543 non-null object 2 store 27543 non-null object 3 category 27543 non-null object 4 brand 27543 non-null object 5 f_FPro 26186 non-null float64 6 f_min_FPro 26152 non-null float64 7 f_std_FPro 26152 non-null float64 8 f_FPro_class 26186 non-null float64 9 ingredients 27543 non-null object 10 has10_nuts 27543 non-null int64 11 is_Nuts_Converted_100g 27543 non-null int64 12 Protein 27543 non-null float64 13 Total Fat 27543 non-null float64 14 Carbohydrate 27543 non-null float64 15 sugar_total 27543 non-null float64 16 Fiber, total dietary 27543 non-null float64 17 Calcium 26669 non-null float64 18 Iron 26516 non-null float64 19 Sodium 27543 non-null float64 20 Vitamin C 13041 non-null float64 21 Cholesterol 27319 non-null float64 22 Fatty acids, total saturated 27543 non-null float64 23 Total Vitamin A 13025 non-null float64 dtypes: float64(16), int64(2), object(6) memory usage: 5.3+ MB
# checking for outliers
fig = px.histogram(df_cleaned, x='category', nbins=20, title='Distribution of Category')
fig.show()
# Generate a unique image name using the current date and time
image_name = f"Histogramm_category{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
fig = px.histogram(df_cleaned, x='Total Fat', nbins=30, title='Distribution of Total Fat')
fig.show()
# Generate a unique image name using the current date and time
image_name = f"Histogramm_Total_Fat{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
fig = px.histogram(df_cleaned, x='Fiber, total dietary', nbins=30, title='Distribution of Fiber, total dietary')
fig.show()
# Generate a unique image name using the current date and time
image_name = f"Histogramm_Fiber{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
fig = px.histogram(df_cleaned, x='Carbohydrate', nbins=30, title='Distribution of Carbohydrate')
fig.show()
# Generate a unique image name using the current date and time
image_name = f"Histogramm_Carbohydrate{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
# Print the path for reference
print(f"Image saved at: {image_path}")
Image saved at: /Users/franziska/Documents/__CareerFoundry_Data Analyst/6_Exercise/Project_Food/04 Analysis/Visualizations/Histogramm_Carbohydrate20231222_134102.png
# Exclude non-numeric columns (you may customize this based on your DataFrame)
numeric_columns = df_cleaned.select_dtypes(include='number').columns
# Create a scatter plot for each numeric column
for column in numeric_columns:
fig = px.scatter(df_cleaned, y=column, x="sugar_total", height=500, title=f"Scatter Plot: {column} vs sugar_total")
fig.show()
# Generate a unique image name using the current date and time
image_name = f"Scatterplot Matrix_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
# create new column sugar_sum, aggregate the total_sugar in every category
df_cleaned['sugar_sum'] = df_cleaned.groupby(['category'])['sugar_total'].transform('sum')
fig = px.sunburst(df_cleaned, color="sugar_sum", path=["category"], hover_name="brand", height=500)
fig.show()
# Generate a unique image name using the current date and time
image_name = f"sunburst_sugar{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
# Print the path for reference
print(f"Image saved at: {image_path}")
Image saved at: /Users/franziska/Documents/__CareerFoundry_Data Analyst/6_Exercise/Project_Food/04 Analysis/Visualizations/sunburst_sugar20231222_134106.png
# creating a DataFrame df_sugar_count for export
df_sugar_count = pd.DataFrame({
'category': df_cleaned['category'].unique(),
'sum': df_cleaned.groupby('category')['sugar_sum'].sum().reset_index(drop=True)
})
# Round the 'sum' column to a specified number of decimal places (e.g., 2)
df_sugar_count['sum'] = df_sugar_count['sum'].round(2)
# Filename to CVS Export
file_name = 'category_sugar_count'
# Concatenate the path and filename
file_path = os.path.join(save_cvs_path, f"{file_name}.csv")
# Export the DataFrame to CSV
df_sugar_count.to_csv(file_path, index=False)
df_cleaned['sugar_total'].head()
0 14.000000 1 9.375000 2 9.375000 4 12.000000 5 26.315789 Name: sugar_total, dtype: float64
# Create a scatter plot with hover information
fig = px.scatter(df_cleaned, x='sugar_total', title='Scatter Plot of Sugar Total', hover_data=['original_ID'])
# Show the plot
fig.show()
# Filter the DataFrame
high_sugar_products = df_cleaned[df_cleaned['sugar_total'] > 99]
# Display the list of products
print(high_sugar_products[['original_ID', 'product', 'sugar_total']])
original_ID \
157 wf_365-by-whole-foods-market-green-decorating-...
172 wf_big-tree-farms-organic-coconut-sugar-5472ae
604 wm_10315012
2631 wm_233352116
6676 wm_49913335
... ...
45318 wm_160196574
46068 wm_108460286
46174 wm_991636575
46657 wm_12444630
47572 wm_22282328
product sugar_total
157 Green Decorating Sugar, 3.3 oz 100.000000
172 Organic Coconut Sugar, 14 oz 100.000000
604 Great Value Light Brown Sugar, 2 Lb 200.000000
2631 Freshness Guaranteed Variety Cheesecake, 40 oz... 223.404255
6676 Goya Maria Cookies, 3.5 oz 120.000000
... ... ...
45318 Reese's Take5, Pretzels, Chocolate, Caramel, K... 128.571429
46068 Nubu Nut Butter Peanut Bites, 6 oz. 216.666667
46174 Nubu Nut Butter Cashew Bites, 5.5 oz. 216.666667
46657 Kernel Season's Kettle Corn Popcorn Seasoning,... 400.000000
47572 Kitchen Basics Original Seafood Stock, 32 fl oz 100.000000
[105 rows x 3 columns]
df_cleaned['sugar_total'].describe()
count 27543.000000 mean 12.241035 std 19.655263 min 0.000000 25% 0.675676 50% 3.846154 75% 17.293233 max 500.000000 Name: sugar_total, dtype: float64
# Create a box plot for 'sugar_total'
fig = px.box(df_cleaned, y='sugar_total', title='Box Plot of Sugar Total')
# Show the plot
fig.show()
# Generate a unique image name using the current date and time
image_name = f"boxplot_sugar{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
df_cleaned['salt_sum'] = df_cleaned.groupby('category')['ingredients'].transform(lambda x: x.str.count('salt'))
df_cleaned['salt_sum']
0 0
1 0
2 0
4 0
5 1
..
50421 1
50423 0
50433 7
50448 1
50458 1
Name: salt_sum, Length: 27543, dtype: int64
## df_cleaned[df_cleaned['salt_sum'] == 7]
# One ingredients list where the product hat salt listed 7 times
ingredients_of_row_with_7 = df_cleaned.loc[df_cleaned['salt_sum'] == 7, 'ingredients'].values[0]
ingredient_list = eval(ingredients_of_row_with_7)
# Print the entire list of ingredients
print(ingredient_list)
['bread: enriched wheat flour (flour', ' ascorbic acid added as a dough conditioner', ' niacin', ' reduced iron', ' thiamine mononitrate [vitamin b1]', ' riboflavin [vitamin b2]', ' folic acid)', ' water', ' sugar', ' yeast', ' salt', ' soybean oil', ' whey', ' dough conditioner blend (dextrose', ' wheat flour', ' sodium stearoyl lactylate', ' vegetable oil [canola and/or soy oil]', ' amylase', ' ascorbic acid', ' l-cysteine monohydrochloride)', ' natural flavor', ' cultured dextrose (as a preservative)', ' microbial enzymes', ' cheese blend: mozzarella cheese (cultured pasteurized reduced fat milk', ' salt', ' enzymes)', ' provolone cheese with natural smoke flavor (cultured pasteurized milk', ' salt', ' enzymes', ' natural smoke flavor)', ' parmesan cheese (cultured pasteurized milk', ' salt', ' enzymes)', ' powdered cellulose (to prevent caking)', ' spread: margarine (palm oil', ' soy oil', ' water', ' salt', ' soy lecithin', ' monoglycerides', ' potassium sorbate as a preservative', ' natural flavor', ' vitamin a palmitate', ' vitamin d3', ' beta-carotene [color])', ' soybean oil', ' dehydrated garlic', ' parmesan cheese (pasteurized partly skimmed milk', ' salt', ' bacterial culture', ' lipase', ' microbial enzyme)', ' natural flavor', ' salt', ' dehydrated parsley', ' citric acid (as a preservative)', ' potassium sorbate (as a preservative)', ' beta-carotene (color)']
fig = px.sunburst(df_cleaned, color="salt_sum", path=["category"], hover_name="brand", height=500)
fig.show()
# creats a new column and counts the word SALT for every product in each ingredients
#df_cleaned['word_salt_count'] = df_cleaned['ingredients'].str.count('salt', 'sea salt')
#df_cleaned['word_salt_count'] = df_cleaned['ingredients'].str.count('(?i)salt|sea salt|another_word')
import re
# Defines a regex pattern for various forms of "salt"
salt_pattern = re.compile(r'\bsalt\b|\bsea\salt\b|\salt.\b', flags=re.IGNORECASE)
# Applys the pattern to the 'ingredients' column
df_cleaned['word_salt_count'] = df_cleaned['ingredients'].apply(lambda x: len(salt_pattern.findall(str(x))))
df_cleaned['category_counts'] = df_cleaned['category'].map(df_cleaned['category'].value_counts())
# counts the products in category
category_counts = df_cleaned['category'].value_counts()
df_cleaned['category_counts'] = df_cleaned.groupby('category').ngroup() + 1
result_table = df_cleaned.groupby(['category']).agg({'word_salt_count': 'sum'}).reset_index()
result_table['products_in_category'] = df_cleaned.groupby('category')['product'].nunique().loc[result_table['category']].values
result_table.head(5)
| category | word_salt_count | products_in_category | |
|---|---|---|---|
| 0 | baby-food | 6 | 124 |
| 1 | baking | 212 | 387 |
| 2 | bread | 800 | 697 |
| 3 | breakfast | 584 | 293 |
| 4 | cakes | 698 | 415 |
fig = px.sunburst(df_cleaned, path=['category'], values='word_salt_count')
fig.update_layout(width=1000, height=600)
# Generate a unique image name using the current date and time
image_name = f"sunburst_salt{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
# Combine the directory and image name
image_path = os.path.join(save_image_path, image_name)
# Export as image (PNG)
fig.write_image(image_path, engine="orca")
# Filename to CVS Export
file_name = 'category_salt_count_products'
# Concatenate the path and filename
file_path = os.path.join(save_cvs_path, f"{file_name}.csv")
# Export the DataFrame to CSV
result_table.to_csv(file_path, index=False)
# count all ingredients for each product
df_cleaned['ingredients'] = df_cleaned['ingredients'].astype(str) # Convert to string in case there are non-string values
# Create the ingredient_count column
df_cleaned['ingredient_total'] = df_cleaned['ingredients'].apply(lambda x: len(x.split(',')))
# Display the counts
ingredient_counts = df_cleaned.groupby('product')['ingredient_total'].max().reset_index()
print(ingredient_counts)
product ingredient_total 0 "World's Best" Mac & Cheese, 20 oz 39 1 (1 Can) Red Bull Energy Drink, Cranberry, 8.4 ... 14 2 (12 Bottles) Lipton Diet Green Iced Tea with C... 12 3 (12 Bottles) Lipton Diet Mixed Berry Green Tea... 12 4 (12 Bottles) Lipton Iced Tea, Lemon, 16.9 fl oz 11 ... ... ... 27065 unbagels plain, 12.35 oz 13 27066 unbread, 20 oz 12 27067 vitaminwater zero sugar squeezed, electrolyte ... 19 27068 ® Uncured Italian Dry Salami 8 27069 ™ Original Crostini, 6 oz 4 [27070 rows x 2 columns]
# Create a new column 'processing_category' based on the number of ingredients
df_cleaned['processing_category'] = pd.cut(df_cleaned['ingredient_total'],
bins=[0, 5, 10, 15, float('inf')],
labels=['minimal processed', 'processed', 'high processed', 'ultra processed'],
right=False)
# Display the updated DataFrame
## print(df_cleaned[['product', 'ingredient_total', 'processing_category']])
category_processing = df_cleaned[['category', 'processing_category']].value_counts().reset_index()
# Filename to CVS Export
file_name = 'category_processing_category'
# Concatenate the path and filename
file_path = os.path.join(save_cvs_path, f"{file_name}.csv")
# Export the DataFrame to CSV
category_processing.to_csv(file_path, index=False)
category_counts = df_cleaned['processing_category'].value_counts().reset_index()
category_counts.columns = ['processing_category', 'count']
# Display the count of each processing category
print(category_counts)
processing_category count 0 ultra processed 12365 1 minimal processed 5673 2 processed 4987 3 high processed 4518
category_counts = pd.crosstab(df_cleaned['category'], df_cleaned['processing_category']).stack().reset_index(name='product_count')
category_counts
| category | processing_category | product_count | |
|---|---|---|---|
| 0 | baby-food | minimal processed | 30 |
| 1 | baby-food | processed | 68 |
| 2 | baby-food | high processed | 23 |
| 3 | baby-food | ultra processed | 5 |
| 4 | baking | minimal processed | 128 |
| ... | ... | ... | ... |
| 207 | spices-seasoning | ultra processed | 41 |
| 208 | spread-squeeze | minimal processed | 115 |
| 209 | spread-squeeze | processed | 145 |
| 210 | spread-squeeze | high processed | 21 |
| 211 | spread-squeeze | ultra processed | 30 |
212 rows × 3 columns
fig = px.scatter(category_counts, x='category', y='processing_category', size='product_count',
title='Scatter Plot of Processing Categories by Count',
labels={'product_count': 'Count'},
color='product_count', hover_name='product_count')
fig.show()
fig = px.scatter(df_cleaned, y="sugar_total", x="word_salt_count", color="processing_category", hover_name="product", height=500)
fig.show()
# Filter the DataFrame for 'ultra processed' category
ultra_processed_df = df_cleaned[df_cleaned['processing_category'] == 'ultra processed']
# Create a sunburst chart
fig = px.sunburst(ultra_processed_df, path=['category'],
title='Sunburst Chart for Ultra Processed Foods',
hover_data=['ingredient_total'],
labels={'product': 'Product'})
# Set a larger size for the chart
fig.update_layout(width=1000, height=600)
fig.show()
# Filter the DataFrame for 'ultra processed' category
processed_df = df_cleaned[df_cleaned['processing_category'] == 'processed']
# Create a sunburst chart
fig = px.sunburst(processed_df, path=['category'],
title='Sunburst Chart for Processed Foods',
hover_data=['ingredient_total'],
labels={'product': 'Product'})
# Set a larger size for the chart
fig.update_layout(width=1000, height=600)
fig.show()
category_counts = df_cleaned['category'].value_counts()
# Calculate the percentage for each category
category_percentages = (category_counts / len(df_cleaned)) * 100
# Round the percentages to two decimal places
category_percentages = category_percentages.round(2)
# Display the result
print(category_percentages)
category prepared-meals-dishes 7.51 pastry-chocolate-candy 5.94 snacks-bars 5.25 cookies-biscuit 4.66 produce-beans-wf 4.24 produce-packaged 4.14 snacks-mixes-crackers 4.11 cheese 3.73 snacks-chips 3.62 meat-packaged 3.55 ice-cream-dessert 3.33 dairy-yogurt-drink 3.05 cereal 2.99 bread 2.57 sauce-all 2.50 rolls-buns-wraps 2.43 soup-stew 2.37 dressings 2.23 sausage-bacon 1.76 snacks-dips-salsa 1.70 drink-juice 1.67 cakes 1.51 milk-milk-substitute 1.48 baking 1.42 seafood 1.38 drink-shakes-other 1.36 pizza 1.36 snacks-nuts-seeds 1.33 meat-poultry-wf 1.17 spread-squeeze 1.13 breakfast 1.07 muffins-bagels 0.95 pasta-noodles 0.94 culinary-ingredients 0.94 drink-coffee 0.93 spices-seasoning 0.91 jerky 0.88 snacks-popcorn 0.77 mac-cheese 0.77 drink-soft-energy-mixes 0.77 nuts-seeds-wf 0.73 drink-tea 0.65 pudding-jello 0.58 salad 0.49 coffee-beans-wf 0.49 drink-juice-wf 0.46 baby-food 0.46 rice-grains-packaged 0.45 rice-grains-wf 0.41 eggs-wf 0.40 seafood-wf 0.36 drink-water-wf 0.09 canned-goods 0.00 Name: count, dtype: float64
# Create a DataFrame with counts and percentages
category_stats = pd.DataFrame({
'category': category_counts.index,
'count': category_counts.values,
'percentage': category_percentages.values
})
# Display the result
print(category_stats)
category count percentage 0 prepared-meals-dishes 2068 7.51 1 pastry-chocolate-candy 1637 5.94 2 snacks-bars 1445 5.25 3 cookies-biscuit 1284 4.66 4 produce-beans-wf 1168 4.24 5 produce-packaged 1140 4.14 6 snacks-mixes-crackers 1133 4.11 7 cheese 1027 3.73 8 snacks-chips 998 3.62 9 meat-packaged 978 3.55 10 ice-cream-dessert 918 3.33 11 dairy-yogurt-drink 840 3.05 12 cereal 823 2.99 13 bread 707 2.57 14 sauce-all 689 2.50 15 rolls-buns-wraps 670 2.43 16 soup-stew 654 2.37 17 dressings 613 2.23 18 sausage-bacon 485 1.76 19 snacks-dips-salsa 468 1.70 20 drink-juice 459 1.67 21 cakes 416 1.51 22 milk-milk-substitute 408 1.48 23 baking 390 1.42 24 seafood 381 1.38 25 drink-shakes-other 375 1.36 26 pizza 374 1.36 27 snacks-nuts-seeds 366 1.33 28 meat-poultry-wf 321 1.17 29 spread-squeeze 311 1.13 30 breakfast 295 1.07 31 muffins-bagels 262 0.95 32 pasta-noodles 259 0.94 33 culinary-ingredients 258 0.94 34 drink-coffee 256 0.93 35 spices-seasoning 250 0.91 36 jerky 243 0.88 37 snacks-popcorn 213 0.77 38 mac-cheese 212 0.77 39 drink-soft-energy-mixes 211 0.77 40 nuts-seeds-wf 202 0.73 41 drink-tea 179 0.65 42 pudding-jello 159 0.58 43 salad 136 0.49 44 coffee-beans-wf 135 0.49 45 drink-juice-wf 127 0.46 46 baby-food 126 0.46 47 rice-grains-packaged 125 0.45 48 rice-grains-wf 113 0.41 49 eggs-wf 111 0.40 50 seafood-wf 100 0.36 51 drink-water-wf 24 0.09 52 canned-goods 1 0.00
import plotly.express as px
# Assuming df_cleaned is your DataFrame and you want to select numeric columns
numeric_columns = df_cleaned.select_dtypes(include='number')
# Calculate the correlation matrix
correlation_matrix = numeric_columns.corr()
# Create a figure object for the heatmap
fig = px.imshow(correlation_matrix, x=correlation_matrix.index, y=correlation_matrix.columns)
# Customize the heatmap layout and set a larger size (adjust height and width as needed)
fig.update_layout(
title='Correlation Heatmap',
xaxis_title='X-axis labels',
yaxis_title='Y-axis labels',
height=900, # Set the height of the plot
width=900 # Set the width of the plot
)
fig.show()
# Filename to CVS Export
file_name = 'category_sugar_count'
# Concatenate the path and filename
file_path = os.path.join(save_cvs_path, f"{file_name}.csv")
# Export the DataFrame to CSV
df_sugar_count.to_csv(file_path, index=False)
# Reshape the variables into NumPy arrays and put them into separate objects. X = df_cleaned['Carbohydrate'].values.reshape(-1,1) y = df_cleaned['category_counts'].values.reshape(-1,1)
# Reshape the variables into NumPy arrays and put them into separate objects.
X = df_cleaned['word_salt_count'].values.reshape(-1,1) # the independent variable
y = df_cleaned['ingredient_total'].values.reshape(-1,1)# the dependent variable
X
array([[0],
[0],
[0],
...,
[7],
[1],
[1]])
y
array([[16],
[11],
[ 8],
...,
[84],
[ 5],
[ 8]])
# Split data into a training set and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Create a regression object.
regression = LinearRegression() # This is the regression object, which will be fit onto the training set.
# Fit the model to your training data
regression.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# The predict method
y_predicted = regression.predict(X_test)
fig = px.scatter(x=X_test.flatten(), y=y_test.flatten(), title='Counted Word Salt and the Amount of Ingredients', labels={'x': 'word_salt_count', 'y': 'ingredient_total'})
fig.add_scatter(x=X_test.flatten(), y=y_predicted.flatten(), mode='lines', line=dict(color='red', width=3))
fig.show()
# Create objects that contain the model summary statistics.
rmse = mean_squared_error(y_test, y_predicted) # mean squared error
r2 = r2_score(y_test, y_predicted) # R2 score.
# Model summary statistics.
print('Slope:' ,regression.coef_)
print('Mean squared error: ', rmse)
print('R2 score: ', r2)
Slope: [[7.93381811]] Mean squared error: 108.24801034207725 R2 score: 0.5525072795742065
from scipy.stats import linregress
# Assuming X and y are your independent and dependent variables
slope, intercept, r_value, p_value, std_err = linregress(X.flatten(), y.flatten())
print(f'P-Value: {p_value}')
P-Value: 0.0
data = pd.DataFrame ({'Actual':y_test.flatten(), 'Predicetd':y_predicted.flatten()})
data
| Actual | Predicetd | |
|---|---|---|
| 0 | 9 | 6.518841 |
| 1 | 7 | 6.518841 |
| 2 | 18 | 14.452659 |
| 3 | 2 | 6.518841 |
| 4 | 22 | 14.452659 |
| ... | ... | ... |
| 8258 | 4 | 6.518841 |
| 8259 | 11 | 6.518841 |
| 8260 | 12 | 14.452659 |
| 8261 | 15 | 14.452659 |
| 8262 | 7 | 14.452659 |
8263 rows × 2 columns
# checking for outliers
fig = px.histogram(df_cleaned, x='word_salt_count', nbins=20, title='Distribution of Word Salt')
fig.show()
fig = px.histogram(df_cleaned, x='ingredient_total', nbins=30, title='Distribution of Ingredient Total')
fig.show()
# Order the 'ingredient_total' column in descending order
sorted_ingredient_total = df_cleaned['ingredient_total'].sort_values(ascending=False)
# Display the sorted 'ingredient_total' column
print(sorted_ingredient_total)
27754 281
31409 164
2668 157
22337 146
41482 143
...
11921 1
40397 1
40405 1
40408 1
23695 1
Name: ingredient_total, Length: 27543, dtype: int64
unique_categories = df_cleaned['category'].unique()
print(unique_categories)
['baby-food' 'baking' 'bread' 'breakfast' 'cakes' 'canned-goods' 'cereal' 'cheese' 'coffee-beans-wf' 'cookies-biscuit' 'culinary-ingredients' 'dairy-yogurt-drink' 'dressings' 'drink-coffee' 'drink-juice' 'drink-juice-wf' 'drink-shakes-other' 'drink-soft-energy-mixes' 'drink-tea' 'drink-water-wf' 'eggs-wf' 'ice-cream-dessert' 'jerky' 'mac-cheese' 'meat-packaged' 'meat-poultry-wf' 'milk-milk-substitute' 'muffins-bagels' 'nuts-seeds-wf' 'pasta-noodles' 'pastry-chocolate-candy' 'pizza' 'prepared-meals-dishes' 'produce-beans-wf' 'produce-packaged' 'pudding-jello' 'rice-grains-packaged' 'rice-grains-wf' 'rolls-buns-wraps' 'salad' 'sauce-all' 'sausage-bacon' 'seafood' 'seafood-wf' 'snacks-bars' 'snacks-chips' 'snacks-dips-salsa' 'snacks-mixes-crackers' 'snacks-nuts-seeds' 'snacks-popcorn' 'soup-stew' 'spices-seasoning' 'spread-squeeze']
selected_categories = ['baking',
'bread',
'canned-goods',
'cheese',
'culinary-ingredients',
'dressings',
'eggs-wf',
'ice-cream-dessert',
'jerky', 'mac-cheese',
'meat-packaged',
'meat-poultry-wf',
'pasta-noodles',
'pastry-chocolate-candy',
'pizza',
'prepared-meals-dishes',
'produce-beans-wf',
'produce-packaged',
'rice-grains-packaged',
'rice-grains-wf',
'rolls-buns-wraps',
'sauce-all',
'sausage-bacon',
'seafood-wf',
'snacks-bars',
'snacks-chips',
'snacks-dips-salsa',
'snacks-mixes-crackers',
'snacks-nuts-seeds',
'soup-stew',
'spices-seasoning',
'spread-squeeze']
df_cleaned_categories = df_cleaned[df_cleaned['category'].isin(selected_categories)]
# Reshape the variables into NumPy arrays and put them into separate objects.
X = df_cleaned_categories['word_salt_count'].values.reshape(-1,1) # the independent variable
y = df_cleaned_categories['ingredient_total'].values.reshape(-1,1)# the dependent variable
X
array([[0],
[1],
[0],
...,
[7],
[1],
[1]])
y
array([[ 2],
[12],
[ 5],
...,
[84],
[ 5],
[ 8]])
# Split data into a training set and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Create a regression object.
regression = LinearRegression() # This is the regression object, which will be fit onto the training set.
# Fit the model to your training data
regression.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# The predict method
y_predicted = regression.predict(X_test)
fig = px.scatter(x=X_test.flatten(), y=y_test.flatten(), title='Counted Word Salt and the Amount of Ingredients', labels={'x': 'word_salt_count', 'y': 'ingredient_total'})
fig.add_scatter(x=X_test.flatten(), y=y_predicted.flatten(), mode='lines', line=dict(color='red', width=3))
fig.show()
# Create objects that contain the model summary statistics.
rmse = mean_squared_error(y_test, y_predicted) # mean squared error
r2 = r2_score(y_test, y_predicted) # R2 score.
# Model summary statistics.
print('Slope:' ,regression.coef_)
print('Mean squared error: ', rmse)
print('R2 score: ', r2)
Slope: [[7.83930017]] Mean squared error: 114.31023648678583 R2 score: 0.5860225677224559
from scipy.stats import linregress
# Assuming X and y are your independent and dependent variables
slope, intercept, r_value, p_value, std_err = linregress(X.flatten(), y.flatten())
print(f'P-Value: {p_value}')
P-Value: 0.0
from scipy.stats import pearsonr
# 'word_salt_count' and 'ingredient_total' are your two columns
correlation_coefficient, p_value = pearsonr(df_cleaned['word_salt_count'], df_cleaned['ingredient_total'])
print(f'Pearson Correlation Coefficient: {correlation_coefficient}')
print(f'P-Value: {p_value}')
Pearson Correlation Coefficient: 0.7610074484256731 P-Value: 0.0
num_clusters = range(1, 10) # Defines the range of potential clusters in the data
kmeans = [KMeans(n_clusters=i, n_init=10) for i in num_clusters] # Defines k-means clusters with n_init set explicitly
# Select only the numeric columns for clustering (excluding the country column)
numeric_df = df_cleaned.select_dtypes(include=['float64', 'int'])
# Fill NaN values with 0 in the new numeric DataFrame
numeric_df.fillna(0, inplace=True)
score = [kmeans[i].fit(numeric_df).inertia_ for i in range(len(kmeans))]
# a rate of variation for the given cluster option.
score
[7879254266571.538, 1523241208235.2236, 559104525481.2881, 201369777738.76828, 105738532368.7413, 65806948331.815384, 31074839197.61143, 16639203528.842346, 11824364338.472075]
# Create a DataFrame to hold the elbow curve data
elbow_data = pd.DataFrame({'NumClusters': num_clusters, 'Score': score})
# Create an interactive elbow curve using Plotly
fig = px.line(elbow_data, x='NumClusters', y='Score', title='Elbow Curve')
fig.update_xaxes(title_text='Number of Clusters')
fig.update_yaxes(title_text='Score')
import sklearn
print(sklearn.__version__)
1.3.0
import dask.array as da
from dask_ml.cluster import KMeans
# Defines chunk size based on your dataset and memory constraints
chunk_size = 1000
# Convert your Pandas DataFrame to a Dask Array
dask_data = da.from_array(numeric_df.values, chunks=chunk_size)
# Create a Dask k-means model
kmeans = KMeans(n_clusters=4, init_max_iter=20, oversampling_factor=10)
# Fit the model
kmeans.fit(dask_data)
# Predict cluster labels
labels = kmeans.labels_.compute()
kmeans = KMeans(n_clusters=2, n_jobs=-1) # Specify the number of clusters and enable parallel processing
# Fit the k-means object to the data.
kmeans.fit(numeric_df )
KMeans(n_clusters=2, n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=2, n_jobs=-1)
numeric_df['clusters'] = kmeans.predict(numeric_df )
numeric_df['clusters'].value_counts()
clusters 0 23177 1 4366 Name: count, dtype: int64
numeric_df['word_salt_count']
fig = px.scatter(numeric_df, x='word_salt_count', y='ingredient_total', color='clusters', size_max=10, hover_name='category_counts')
fig.update_layout(title='Clustered Data Plot', xaxis_title='Word Salt Count', yaxis_title='ingredient_total')
fig.show()
numeric_df['word_salt_count']
fig = px.scatter(numeric_df, x='category_counts', y='ingredient_total', color='clusters', size_max=10, hover_name='category_counts')
fig.update_layout(title='Clustered Data Plot', xaxis_title='Word Salt Count', yaxis_title='ingredient_total')
fig.show()
# Assuming 'category_counts' is a column in your numeric_df DataFrame
numeric_df['word_salt_count'] = df_cleaned['word_salt_count']
fig = px.scatter_3d(
numeric_df,
x='word_salt_count',
y='ingredient_total',
z='category_counts',
color='category_counts', # Replace with the correct clustering column name
size_max=10,
hover_name='word_salt_count'
)
fig.update_layout(
title='3D Clustered Data Plot for Sugar',
scene=dict(xaxis_title='Word Salt Count', yaxis_title='ingredient_total', zaxis_title='category_counts'),
width=800,
height=600
)
fig.show()
# Filename to CVS Export
file_name = 'df_cleaned_end'
# Concatenate the path and filename
file_path = os.path.join(save_cvs_path, f"{file_name}.csv")
# Export the DataFrame to CSV
df_cleaned.to_csv(file_path, index=False)
df_cleaned.info()